快手客户端稳定性体系建设_wakeups: 45001 wakeups over the last 23 seconds (1-CSDN博客

本文链接：https://blog.csdn.net/Kwai_tech/article/details/107964806

背景

当我们谈论稳定性的时候，通常指的是crash，android有java crash & native crash，iOS有NSException & BSD signal & Mach EXC，业内通用的指标主要是session crash率(crash次数 / 启动数)和设备crash率(crash设备数 / 总设备数)。

对于上述指标，业界有多个成熟的监控平台，提供埋点、上报、展示到报警一站式服务，如Bugly、Fabric等。也有专业捕获crash的SDK，方便我们在此基础上按需定制监控平台，如KSCrash、Breakpad等，有这么完善的基础设施，稳定性的拼图似乎已经很完整了，果真如此么？很遗憾并不是，接下来我们开始讨论本文第一部分，退出率，请看下面这个案例。

退出率

从Wakeups说起

某日，多个大V主播反馈频繁崩溃(iOS)，无法开播，要知道，大V的一次开播事故，影响的是百万级用户的用户体验，比一次普通的崩溃严重的多。相关同学立刻开始调查，结果没有一例crash上报，通过这次事件，我们意识到现有的crash监控系统是有漏洞的，正在无计可施之时，运营同学非常给力的拿到了主播的系统日志，其中关键的信息如下：
Wakeups: 45001 wakeups over the last 142 seconds (316 wakeups per second average), exceeding limit of 150 wakeups per second over 300 seconds

这里的Wakeups是什么意思呢？查阅苹果官方文档^[1]，可以看到如下解释：

Resource Limit [EXC_RESOURCE]

The process exceeded a resource consumption limit. This is a notification from the OS that the process is using too many resources. The exact resource is listed in the Exception Subtype field. If the Exception Note field contains NON-FATAL CONDITION, then the process was not killed even though a crash report was generated.

The exception subtype WAKEUPS indicates that threads in the process are being woken up too many times per second, which forces the CPU to wake up very often and consumes battery life.

Typically, this is caused by thread-to-thread communication (generally using peformSelector:onThread: or dispatch_async) that is unwittingly happening far more often than it should be. Because the sort of communication that triggers this exception is happening so frequently, there will usually be multiple background threads with very similar Backtraces - indicating where the communication is originating.

简单概括如下：Wakeups是“资源异常”下的一个子类，指的是频繁唤醒线程，消耗CPU资源并增加功耗，在超过阈值并处于FATAL CONDITION的条件下会触发崩溃，通常见于线程间频繁交互的场景。

了解了原理，接下来就好办了，通过分析系统日志中记录的触发唤醒backtrace，定位到问题发生的原因是粉丝们频繁的给大V发私信，导致高频的线程交互以及磁盘读写，这两个操作都会触发线程唤醒，最终使wakeups超出阈值。我们通过优化这两个操作，降低了线程唤醒频率，大V开播恢复了正常。至此，wakeups似乎圆满解决了，但作为有追求的程序员，不能满足于只解决眼前问题，如果用户不给我们反馈，或者不肯上传系统日志怎么办？我们需要能在线上监控到wakeups问题，要做到这一点，我们需要深入源码，了解操作系统是怎么做的。

Wakeups是怎么触发的

图1

图1是通过阅读XNU源码总结的系统监控wakeups的流程图，task_ledgers是内核维护的当前进程的”账本“，保存了各种系统资源的使用情况。当发生频繁唤醒时，会通过init_task_ledgers注册的回调函数task_wakeups_rate_exceeded进行处理，若参数warning的值为1，说明wakeups超出警戒线，开启遥测收集唤醒线程的堆栈，若warning的值为2，说明wakeups回落到警戒线以下，关闭遥测，若warning的值为0，说明wakeups超出阈值，调用SENDING_NOTIFICATION__THIS_PROCESS_IS_CAUSING_TOO_MANY_WAKEUPS触发EXC_RESOURCE，当满足fatal条件时，调用task_terminate_internal终止进程。

void init_task_ledgers(void) {
   
 	// ...
   	// 注册wakeups回调
ledger_set_callback(t, task_ledgers.interrupt_wakeups,
		task_wakeups_rate_exceeded, NULL, NULL);
}

Wakeups的阈值定义由以下几个部分组成

#define TASK_WAKEUPS_MONITOR_DEFAULT_LIMIT		150 /* wakeups per second */
#define TASK_WAKEUPS_MONITOR_DEFAULT_INTERVAL	300 /* in seconds. */
/*

 * Level (in terms of percentage of the limit) at which the wakeups monitor triggers telemetry.
 *
 * (ie when the task's wakeups rate exceeds 70% of the limit, start taking user
 * stacktraces, aka micro-stackshots)
 */
 #define TASK_WAKEUPS_MONITOR_DEFAULT_USTACKSHOTS_TRIGGER	70

如果300秒内的总wakeup数超过45000(300 * 150)，则判断为超出阈值，若超出阈值的70%，则判定为超出警戒线，开启遥测。

  /*
   * Types of warnings that trigger a callback.
   */
   #define	LEDGER_WARNING_ROSE_ABOVE  1
   #define	LEDGER_WARNING_DIPPED_BELOW   2
   void task_wakeups_rate_exceeded(int warning, __unused const void *param0, __unused const void *param1) {
   
   if (warning == LEDGER_WARNING_ROSE_ABOVE) {
   
   #if CONFIG_TELEMETRY		
   	/*
   	 * This task is in danger of violating the wakeups monitor. Enable telemetry on this task
   	 * so there are micro-stackshots available if and when EXC_RESOURCE is triggered.
   	 */
   	telemetry_task_ctl(current_task(), TF_WAKEMON_WARNING, 1);
   #endif
   	return;
   }

#if CONFIG_TELEMETRY
	/*
	 * If the balance has dipped below the warning level (LEDGER_WARNING_DIPPED_BELOW) or
	 * exceeded the limit, turn telemetry off for the task.
	 */
	telemetry_task_ctl(current_task(), TF_WAKEMON_WARNING, 0);
#endif

	if (warning == 0) {
   		SENDING_NOTIFICATION__THIS_PROCESS_IS_CAUSING_TOO_MANY_WAKEUPS();
	}

}

SENDING_NOTIFICATION__THIS_PROCESS_IS_CAUSING_TOO_MANY_WAKEUPS的核心逻辑如下：

 	// 获取wakeup信息
  ledger_get_entry_info(task->ledger, task_ledgers.</